Skip to content

feat(schema): add resourceType dimension to events and gauges (column + index + projection)#17

Closed
lohanidamodar wants to merge 1 commit into
mainfrom
CLO-4383-resourceType-dimension
Closed

feat(schema): add resourceType dimension to events and gauges (column + index + projection)#17
lohanidamodar wants to merge 1 commit into
mainfrom
CLO-4383-resourceType-dimension

Conversation

@lohanidamodar

Copy link
Copy Markdown
Contributor

Summary

Adds resourceType as a first-class dimension column on the events and gauges tables, with a data-skipping index and dedicated per-dim projections so that grouped/filtered reads on resourceType route to a projection instead of a base-table scan.

What resourceType is and why

resourceType is the API-level plural family the metric belongs to: 'functions', 'sites', 'buckets', 'databases', etc. It is distinct from the existing singular resource tag, which is the row's own type — 'deployment', 'function', 'bucket', 'collection', 'file'.

Cloud's StatsResources and StatsUsage already use the plural form everywhere in metric names (METRIC_RESOURCE_TYPE_DEPLOYMENTS, METRIC_RESOURCE_TYPE_BUILDS_STORAGE, etc.) — making it a first-class column lets callers slice "storage by resourceType" or "events by resourceType" without parsing metric names.

Library changes

Metric.php

  • EVENT_COLUMNS and GAUGE_COLUMNS gain 'resourceType' so it round-trips through extractColumns() during write.
  • getEventSchema() / getGaugeSchema() add a string(64) column. (64 is plenty — current values are 8-12 chars; the cap is documentation, not a real ceiling.)
  • getEventIndexes() / getGaugeIndexes() index resourceType with set(0) since cardinality is bounded (~10-20 distinct values across a project lifetime). set outperforms bloom_filter for low-cardinality columns with selective WHERE resourceType = 'X'.
  • getResourceType() accessor.

ClickHouse.php

  • EVENT_PROJECTIONS gains p_by_resourceType — grouped event aggregations on resourceType will use a SUM projection.
  • GAUGE_PROJECTIONS gains:
    • p_by_resourceType — single-dim, for "latest snapshot per resourceType" via argMax(value, time).
    • p_by_resourceType_resource — covers the common "storage breakdown by resourceType × resource" panel in one projection scan.
  • ensureGaugeDimColumns() extended with resourceType; new ensureEventDimColumns() does the equivalent migration on existing events tables. Both invoked from setup() before projections are added — projections that reference resourceType cannot materialize until the source column exists on the base table.

Performance

Query shape Before After
SELECT … FROM events WHERE resourceType = 'functions' GROUP BY metric full-scan + filter p_by_resourceType SUM projection
SELECT … FROM gauges WHERE resourceType = 'sites' GROUP BY resource full-scan + filter + argMax p_by_resourceType_resource projection
SELECT resourceType, sum(value) FROM events GROUP BY resourceType full-scan + GROUP BY p_by_resourceType projection

For very selective WHERE resourceType = 'X' queries without GROUP BY, the set(0) skip-index prunes parts at the granule level before the read.

Compatibility

  • Additive change. EVENT_COLUMNS / GAUGE_COLUMNS gain one entry; extractColumns() is strict and rejects unknown keys, so callers that already pass resourceType in tags begin populating the column immediately. Callers that don't are unchanged.
  • Existing tables migrate on next setup() call via ALTER TABLE ADD COLUMN IF NOT EXISTS. Existing rows have NULL until publishers backfill or new writes land.
  • Projections are added idempotently via the existing addProjection() helper — re-running setup() is safe.

Test plan

  • composer check clean (PHPStan max + baseline)
  • composer lint clean (Pint)
  • composer test against ClickHouse — confirms setup() creates the column, new projections materialize, and addBatch/find round-trip the field through the tag map
  • Manual: confirm SELECT * FROM system.projection_parts WHERE table = 'gauges' AND name = 'p_by_resourceType' after setup() shows parts
  • Manual: EXPLAIN PLAN indexes=1 SELECT … WHERE resourceType = 'X' GROUP BY metric shows the projection in the query plan

Follow-up

A companion PR on appwrite-labs/cloud will:

  • Add 'resourceType' to VALID_DIMENSIONS on /v1/usage/gauges and /v1/usage/events
  • Populate the tag from StatsResources (per-bucket → resourceType='buckets', per-function → 'functions', per-site → 'sites', per-collection/database → 'databases') and from StatsUsage (via inferResourceTypeFromMetric mirroring the existing inferServiceFromMetric)
  • Add resourceType field to UsageDataPoint response model

That lands once this library PR ships and is tagged.

resourceType is the API-level plural resource family the metric belongs
to ('functions', 'sites', 'buckets', 'databases', …) — distinct from
the existing singular `resource` tag (the row's own type: 'deployment',
'function', 'bucket'). Cloud's StatsResources and StatsUsage use this
plural form everywhere in metric names (METRIC_RESOURCE_TYPE_*); making
it a first-class column lets callers slice 'storage by resourceType'
without parsing metric names.

Library changes:

- Metric.php
  - resourceType added to EVENT_COLUMNS and GAUGE_COLUMNS so it round-
    trips through extractColumns() during write.
  - String column (size 64) added to getEventSchema() and
    getGaugeSchema() so new tables include it from the start.
  - getEventIndexes() and getGaugeIndexes() include resourceType with a
    `set(0)` data-skipping index (low cardinality — ~10-20 distinct
    values across a project's lifetime, so set beats bloom_filter for
    selective filters).
  - getResourceType() accessor.

- ClickHouse.php
  - EVENT_PROJECTIONS gains `p_by_resourceType` so grouped event
    aggregations on resourceType route to a sum-projection instead of
    scanning the base table.
  - GAUGE_PROJECTIONS gains `p_by_resourceType` and the combined
    `p_by_resourceType_resource` (latter covers the common
    "storage breakdown by resourceType × resource" panel in one
    projection scan).
  - ensureGaugeDimColumns() extended with resourceType; new
    ensureEventDimColumns() handles the same migration for existing
    events tables. Called from setup() — projections that reference
    resourceType cannot materialize until the source column exists on
    both base tables.

This is a backwards-compatible additive change: existing deployments
get the new column via ALTER TABLE ADD COLUMN IF NOT EXISTS at next
setup(); existing rows have NULL until publishers start populating the
tag (cloud will in a follow-up).
@lohanidamodar

Copy link
Copy Markdown
Contributor Author

Closing — resource already exists as a column on both events and gauges tables with values like 'bucket', 'function', 'site', 'deployment', 'build'. Adding a parallel resourceType (plural form: 'buckets', 'functions', 'sites') would store the same information twice. The cloud endpoints just need to expose the existing resource column as a dimension on the events endpoint (gauges already does). No library change needed.

@greptile-apps

greptile-apps Bot commented Jun 24, 2026

Copy link
Copy Markdown

Greptile Summary

This PR adds resourceType as a usage dimension for events and gauges. The main changes are:

  • Adds resourceType to metric column extraction, schemas, indexes, and the Metric accessor.
  • Adds ClickHouse projections for grouping by resourceType and resourceType × resource.
  • Extends setup migrations to add the new column to existing events and gauges tables.

Confidence Score: 3/5

The change needs fixes before merge because upgraded ClickHouse schemas and daily rollups do not fully support the new dimension, and the metric constant tests are stale.

The review covers both changed files and identifies several concrete compatibility gaps in migration, projection, daily aggregation, schema alignment, and tests.

src/Usage/Adapter/ClickHouse.php and tests/Usage/MetricTest.php

T-Rex T-Rex Logs

What T-Rex did

  • I reproduced the migration index test by running the generated PHP repro harness against a simulated legacy ClickHouse events/gauges schema, showing that index-resourceType is only in skipped fresh-create DDL while migrations add resourceType columns and projections, and post-setup the index-resourceType is absent.
  • I reproduced the materialization of new projections by using the emitted-SQL harness to reconstruct the SQL for existing events and gauges, and observed ADD PROJECTION IF NOT EXISTS for p_by_resourceType and p_by_resourceType_resource but no MATERIALIZE PROJECTION for historical parts.
  • Blocked: PHP runtime is not installed, so the daily rollups path could not be tested.
  • Blocked: targeted PHPUnit and dependency setup could not proceed because vendor/bin/phpunit and composer are not installed, and PHP is not available.
  • I verified the resourceType roundtrip by comparing before and after artifacts, showing that before had resourceType=false and after had resourceType=true along with has_getResourceType_method true, with the artifacts also including the full script, command, working directory, ClickHouse availability check, and exit code.

View all artifacts

T-Rex Ran code and verified through T-Rex

Comments Outside Diff (2)

  1. src/Usage/Adapter/ClickHouse.php, line 851-854 (link)

    P1 Materialize new projections

    ADD PROJECTION IF NOT EXISTS registers the projection, but it does not populate projection parts for rows already present in the table. After upgrading, historical resourceType data is left outside p_by_resourceType / p_by_resourceType_resource, so historical grouped reads can miss the intended projection path. The setup flow should materialize the new projections, or otherwise handle existing parts explicitly.

    Artifacts

    Repro: generated emitted-SQL harness

    • Contains supporting evidence from the run (text/x-python; charset=utf-8).

    Repro: harness execution output showing missing MATERIALIZE PROJECTION

    • Keeps the command output available without making the summary code-heavy.

    View artifacts

    T-Rex Ran code and verified through T-Rex

  2. src/Usage/Adapter/ClickHouse.php, line 1147-1154 (link)

    P2 Keep schema types aligned

    resourceType is added as LowCardinality(Nullable(String)) by the upgrade helpers, but it is not in the getColumnType() low-cardinality list used for fresh table creation. Fresh installs therefore create resourceType Nullable(String), while upgraded installs create LowCardinality(Nullable(String)). This makes the physical schema and query performance differ by install path.

Reviews (1): Last reviewed commit: "feat(schema): add resourceType dimension..." | Re-trigger Greptile

Comment on lines 795 to 818
$sql = "ALTER TABLE {$gaugesTable} "
. 'ADD COLUMN IF NOT EXISTS service LowCardinality(Nullable(String)), '
. 'ADD COLUMN IF NOT EXISTS resource LowCardinality(Nullable(String))';
. 'ADD COLUMN IF NOT EXISTS resource LowCardinality(Nullable(String)), '
. 'ADD COLUMN IF NOT EXISTS resourceType LowCardinality(Nullable(String))';

$this->query($sql);
}

/**
* Backfill late-added dim columns on an existing events table. Same
* reasoning as ensureGaugeDimColumns — CREATE TABLE IF NOT EXISTS won't
* pick up columns added to the schema after the table was first created,
* and a per-dim projection on `resourceType` cannot be materialized until
* the source column exists on the base table.
*/
private function ensureEventDimColumns(): void
{
$eventsTable = $this->escapeIdentifier($this->database)
. '.' . $this->escapeIdentifier($this->getEventsTableName());

$sql = "ALTER TABLE {$eventsTable} "
. 'ADD COLUMN IF NOT EXISTS resourceType LowCardinality(Nullable(String))';

$this->query($sql);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Migrate the indexes

This migration only adds the new resourceType columns. On existing ClickHouse tables, createTable() is skipped by CREATE TABLE IF NOT EXISTS, so the new index-resourceType definitions from Metric::get*Indexes() are never added. After an upgrade, WHERE resourceType = ... queries can still scan without the promised skip index, while fresh installs get different behavior.

Artifacts

Repro: generated migration harness

  • Contains supporting evidence from the run (text/x-php; charset=utf-8).

Repro: harness output showing missing resourceType index migration

  • Keeps the command output available without making the summary code-heavy.

View artifacts

T-Rex Ran code and verified through T-Rex

Comment on lines +815 to +816
$sql = "ALTER TABLE {$eventsTable} "
. 'ADD COLUMN IF NOT EXISTS resourceType LowCardinality(Nullable(String))';

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Update daily rollups

The events base table gets resourceType, but the daily events table, daily materialized view, DAILY_COLUMNS, and findDaily() grouping still omit it. Calls that use the daily APIs with a resourceType filter are rejected by validateDailyAttributeName(), and closed-day event sums filtered by resourceType cannot use the daily rollup because the router treats the column as not daily-safe. This leaves resourceType incomplete as a first-class events dimension for historical usage reads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant